System Software MISC – Perf & PMU/HPM

14th鐵人賽

脆脆

2022-10-10 23:30:58

1242 瀏覽

分享至

藉由benchmark我們可以利用執行時間對比同一支程式在不同的CPU之間執行的效能，但如果要針對單一程式做效能分析，只有整體的執行時間無法提供充足資訊讓我們找到效能瓶頸。我們可以利用專門量測效能的hardware來協助取得資訊: PMU/HMP。

What is PMU/HPM

Basic Introduction

PMU: Performance Monitoring Unit
HPM: Hardware Performance Monitor
PMU會一種hardware，它會計算hardware platform上發生的hardware event數量，software可以藉由event發生的數量分析software運行時的效能
Hardware event:: hardware上發生的事件，例如執行了一條branch instruction

The Component of PMU

Event interface: hardware event的source hardware會在event發生時發送訊號給PMU，而PMU有一個interface接受hardware event訊號
Event selector: PMU可以接受不只一種hardware event，需要由software控制PMU選擇要計算的hardware event
PMC (Performance Monitoring Counter): PMU當中計算event發生數量的counter，event selector選擇的hardware event被event interface接收後，PMC會把記錄的數字往上加。
- PMC紀錄的數字可以被讀取或修改
- 當PMC發生overflow時，PMU會發出interrupt通知CPU
PMU當中會有多個PMC，每個PMC會搭配一個event selector，所以PMU可以同時計算多種event發生的數量

The Application of PMU/HPM - Perf

What is Perf

Perf是Linux kernel當中提供程式效能分析的system utilities。Perf會利用PMU(提供hardware event counter)以及Linux kernel(提供software event counter)提供程式運行時，在每個function內發生的event數量。

Hardware event counter: CPU cycle count, instruction count, cache miss count, interrupt count, branch misprediction count, …等
Software event counter: system call count, scheduler event count, context switch count, …等

The Modes of Perf

Stat: 量測一隻程式從開始執行到結束時的event數量
在程式開始執行時先讀取PMU的PMC，程式結束執行時再讀取一次PMU的PMC，找到兩組PMC的差就可以得到程式執行過程中發生的even數量
例如: 程式開始前將第一組event selector選擇為instruction count並且將PMC歸零，紀錄到instruction count的PMC初始值為0，而程式結束時記錄到instruction count的PMC值為20000，則可以得知此程式在執行過程中執行了20000條instruction

The Application of PMU/HPM - Perf

Record: 對程式進行取樣(sampling)，得知一隻程式在什麼function花最多時間以及原因

原理: 利用PMC overflow時會發出interrupt的功能，perf在interrupt發生的時候將PMC的值與當下的PC記錄下來，用以推測每個function的performance

例如: 我們希望觀察程式的cache miss count，並推測cache miss經常發生在哪些function當中，並且使用10000個CPU cycle當作sampling period

在程式開始的時候，將第一組event selector設定為CPU cycle、第二組event selector設定為cache miss count，並且把第一組PMC的值設定成MAX_VALUE – 10000
當10000個CPU cycles發生後，PMU會因為overflow發出interrupt給CPU，讓perf紀錄當下的PC與第二組PMC上的cache miss count
重新將第一組PMC的值設定為MAX_VALUE – 10000，再繼續讓程式執行
我們可以藉此獲得每10000個CPU cycle當中發生的cache miss count、cache miss count分布在哪些function、哪一些function花了最多時間等資訊